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Statistical methodology is proposed for comparing unlabeled mar- 
ked point sets, with an application to aligning steroid molecules in 
chemoinformatics. Methods from statistical shape analysis are com- 
bined with techniques for predicting random fields in spatial statis- 
^^ ' tics in order to define a suitable measure of similarity between two 

^^ ' marked point sets. Bayesian modeling of the predicted field overlap 

between pairs of point sets is proposed, and posterior inference of the 
alignment is carried out using Markov chain Monte Carlo simulation. 
'^ , , By representing the fields in reproducing kernel Hilbert spaces, the 

degree of overlap can be computed without expensive numerical in- 
tegration. Superimposing entire fields rather than the configuration 
^ , matrices of point coordinates thereby avoids the problem that there 

OO ' is usually no clear one-to-one correspondence between the points. 

0^ I In addition, mask parameters are introduced in the model, so that 

>— ^ ■ partial matching of the marked point sets can be carried out. We 

also propose an adaptation of the generalized Procrustes analysis al- 
fT^ , gorithm for the simultaneous alignment of multiple point sets. The 

c7^ ' methodology is illustrated with a simulation study and then applied 

Cn ' to a data set of 31 steroid molecules, where the relationship between 

shape and binding activity to the corticosteroid binding globulin re- 
ceptor is explored. 



$H ■ 1. Introduction. In many application areas it is of interest to compare 



marked point sets, where measurements (marks) are available at various 
point locations, and often the configurations of points are unlabeled in the 
sense that there is no natural correspondence between the points in each 



Received October 2010; revised May 2011. 

^Supported by an EPSRC/University of Nottingham studentship and a Lever hulme 
Research Fellowship. 

Key words and phrases. Bioinformatics, chemoinformatics, kriging, Markov chain 
Monte Carlo, reproducing kernel Hilbert space, Procrustes, shape, size, spatial, steroids. 



This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Applied Statistics, 

2011, Vol. 5, No. 4, 2603-2629. This reprint differs from the original in pagination 

and typographic detail. 

i 



2 I. CZOGIEL, I. L. DRYDEN AND C. J. BRIGNELL 

configuration. The task of comparing unlabeled marked point sets fias been 
of recent interest in statistical shape analysis, for example, Green and Mardia 
(2006), Dryden, Hirst and Melville (2007) and Schmidler (2007). As opposed 
to these previous approaches, our method does not aim to model point 
correspondences. Instead, the objects are compared by assuming a common 
underlying reference field which gives rise to the spatial distribution of the 
marks. 

One example where the alignment of unlabeled marked point sets is of 
practical importance comes from the fields of structural bioinformatics and 
chemoinformatics, where it is of great interest to align molecules. However, 
the task is often very difficult. The motivating application in this paper is 
a data set comprising 31 steroid molecules which bind to the corticosteroid 
binding globulin (CBG) receptor. For each molecule, the Cartesian coordi- 
nates of the atom positions, as well as the associated van der Waals radii, and 
the partial atomic charge values at the atom positions are provided. Here 
the marks at each point (atom) are either the van der Waals radii or the par- 
tial charges. The steroids fall into three activity classes with respect to their 
binding activity to the CBG receptor [Good, So and Richards (1993)], and 
the main objective in this application is to compare the molecules in order 
to obtain the common features in each of the three groups and to examine 
whether these features are associated with the type of binding activity. 

We consider a simple model under which spatial prediction of a reference 
field is carried out using the observed marks in each configuration. A mea- 
sure of similarity between the two predicted fields is then used to describe 
the similarity, taking into account an unknown transformation between the 
point sets which gave rise to the actually observed point coordinates. The 
parsimonious model does not attempt to model accurately all aspects of 
the molecules in our application. It is rather used to develop a Bayesian 
algorithm based on Markov chain Monte Carlo (MCMC) simulations for 
matching, which is demonstrated to work well in our applications. In this 
setting it is also possible to introduce additional parameters (mask vectors) 
which allow for the fact that only part of the point sets may be similar. 
By determining and aligning only the similar parts of the given point sets, 
a meaningful comparison can be carried out. 

In Section 2 we motivate and describe our newly developed measure of 
similarity for comparing unlabeled marked point sets. The Bayesian frame- 
work for the pairwise alignment and similarity calculation is introduced in 
Section 3. An extension of this methodology to the simultaneous alignment 
of multiple point sets is described in Section 3.3. In Section 4 we carry out 
simulation studies in two and three dimensions to validate our method. In 
Section 5 we apply our methods to the steroids data and assess the results 
with respect to their chemical relevance. Finally, Section 6 concludes the 
paper with a discussion. 
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2. Similarity measures using spatial prediction. 

2.1. Random field model. The starting point for our model is an un- 
derlying reference random field {Z(x):x € M*"} which is assumed to be 
second-order stationary with a constant mean E{Z{x)) = fi and a positive 
definite covariance function cr(h) = Cov(Z(x), Z(x -|- h)) = (t(— h). Consider 
two marked point sets A and B, say, which are z-values at point locations 
in the field given by x^^ e M"*, i = l,...,kA, and xf G M™, j = 1, . . . , /c^. In 
a vector representation, the marked point sets A and B can therefore be 
written as 

Z^ = {zHxn...,zH^'^J}, Z^ = {z^{xr),...,Z^{xg)}, 

respectively. Note that the relative position of A and B as given by {x'^^^, . . . , 
x'A^} and {x'P, . . . ,x'^} is special because the spatial distribution of the 
marks within each point set and also the spatial distribution of the joint set 
of marks z^^ = {z^{x'/'), ..., z^{x'i^^),z^{x'P), ..., z^{x'^^)} directly reflect 
the properties of {Z{x) :x G M"*}. In that sense it can be regarded as the 
true relative position. 

However, in real-life data sets the given marked point sets are often 
provided in arbitrary locations, that is, before being recorded each point 
set is transformed to new locations x^ = $a(x^ ), ^ = 1> • • • > ^A, and x^ = 
$B(xf ), J = 1, . . . , A;b, where $a : K"" ^ M™ and ^b : M"' ^ M™ are unknown 
transformation functions which are assumed to be 1-1 and onto. Hence, the 
inverse transformations <I>^ and $3 exist and satisfy <^^ {$a(x)} =x = 

$a{^X^(x)} and $b^{$b(x)} =x = $B{$inx)}, respectively. 

The basic inference problem we consider in this paper can now be for- 
mulated as follows: if we are given the two marked point sets A and B 
with z recorded at locations {x]^,...,x^ } and z^ recorded at locations 
{x^, . . . ,x^ }, can we measure how similar they are, taking into account the 

unknown transformation $ = <I>a^b from B to Al The method involves 
aligning the point sets by estimating the transformation parameters in $. 

The particular choice of the set of potential transformations will depend 
on the application. In our case the marked point sets are the partial charges 
or the van der Waals radii of the steroid molecules which are recorded in 
arbitrary positions and orientations. As steroid molecules in general are rigid 
(the word is derived from "stereos" = "rigid" in Greek) , we consider the rigid 
body transformations of translation and rotation, that is, 

(1) $(x)=rx + 7, rGS0(m),7GM™, 

where the space of special orthogonal matrices SO(m) contains the rotation 
matrices which satisfy F F = FF = !„ and |F| = 1. Other more compli- 
cated transformations could be used, such as when more dynamic aspects 
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of molecule shape need to be taken into account. For example, movement 
around rotatable bonds could be added if desired in other applications. The 
choice of n and cr(h) in the random field will also depend on the applica- 
tion. 

In order to estimate the transformation parameters in $, we first con- 
sider predicting the underlying reference field Z{x.) using each point set 
separately. A similarity measure is then defined which measures how close 
the two predicted fields are in a certain relative position. Finally, we can 
estimate the unknown transformation by maximizing the similarity measure 
or, alternatively, by developing a statistical model based on the similarity 
measure. 

2.2. Kriging. In order to predict the underlying reference random field 
from each point set, we consider simple kriging [e.g., Cressie (1993), page 110] 
which assumes the mean field ^u = 0. For the steroid molecules with partial 
charge or van der Waals radius marks, it makes sense to fix /i = 0, so that 
a long way from the molecular skeleton the predicted field is zero. We will 
use a sample variogram to help suggest an appropriate covariance func- 
tion. 

Consider a general marked point set z = {z(xi), . . . , z(xfc)}. If simple krig- 
ing is used to predict the value of the underlying random field {Z{x.) : x € M™"} 
at a location of interest xq, say, a weighted average of the form Z{xq) = 
X]j=i^i-2(xi) is sought so as to minimize the prediction mean squared error 
PMSE(u) = E[(Z(xo) - ^(xo))2] with respect to the weight vector u={ui, 
. . . , Uk) . Given the observed values in z, the corresponding system of equa- 
tions has the solution u = Xl^^cr, and the predicted value for Z(-kq) is given 
by Z(xo) = cr(xo)^S~^z = u^z, where cr(xo) = (o-(xi -xq), . . . , a(xfc -xq))^ 
and (5])ij = (t(xj — Xj), 1 < i, j < A:. For a general location x this yields the 
predicted field 

k 
(2) Z(x)=z^S-V(x) = ^i(;,a(xi-x), 

1=1 

where the weight vector w = [wi, . . . ,Wk) ="^2 is optimal in terms of 
minimizing the PMSE if the underlying assumptions are met. Note that in 
some applications it may not be appropriate to assume // = 0, in which case 
one would work with the mean corrected field Z{x.) — fi, where ft is either 
known or estimated using generalized least squares from each marked point 
set. 

Using (2) and based on the observed data vectors z and z^, we can 
obtain a different prediction of the underlying reference random field from 
each of the two marked point sets A and B, and the resulting predicted 
fields .^a(x) and Zb{x) then need to be compared. 
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2.3. Function similarity and the Kernel Carho index. In order to mea- 
sure the similarity of the predicted fields .^a(x) and Zb(x), we require a met- 
ric space where the notion of similarity can be defined by means of the corre- 
sponding inner product. A commonly used metric space for functions is the 
space of Lebesgue square- integrable functions L2, where the inner product 
has the form 



dx. 



(3) (/,g)i, = y"/(x)g(x) 

Based on (3), an intuitive measure of similarity between two functions / 
and g can be formulated which does not depend on the scales of / and g, 
that is, 

p //(x)ff(x)c?x (/,9)l2 



{ffi^yd^y/mgi^yd^y/^ {{fJ)LA9,9)L,y/'' 

and so Rfg = 1 if f = eg, where c > is a positive constant, and i?fg = —1 
if c < 0. Note that i?fg is a generalization of Pearson's correlation coeffi- 
cient for comparing two functions. Also note that, in general, calculation 
of Rig would involve numerical integration over the domain, which may be 
computationally demanding. 

An alternative metric space for functions is a reproducing kernel Hilbert 
space (RKHS) that, for a given reproducing kernel, can easily be constructed 
and is much simpler and quicker to use in practice. This alternative is very 
useful for our model because the covariance function a of the reference ran- 
dom field can be viewed as a reproducing kernel on M™" x M™ due to the 
properties of a general covariance function (e.g., symmetric and positive def- 
inite). Hence, the corresponding RKHS exists [Aronszajn (1950)] and can be 
written as T-L^ = {/|/(x) = X]j=i aiO'(x^ ~ x)}. In this space the inner prod- 
uct of /(x) = ^fi, aia(xf - x) G ^, and ^(x) = E^^, Pja{^f - x) G 7^, 
has the form 

which can be evaluated without expensive numerical integration. 

Note that we can view the kriging predictor (2) as a member of Tia, and, 
hence, we can use the RKHS inner product (•, •)-^^ to measure the similarity 
between the predicted fields of A and B. Let ^a(x) = Yli=i ^i ^(^i ~ ^) ^^'^ 
Zb(x) = '^j=iwfo'{^{x^) — x) denote the predicted fields of the marked 

point sets A and B in the relative position defined by $ = ^a^b ■ "^^^ 
similarity measure we propose in this paper has the form 

(4) Cab(0) = -^— — T^— — , 
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where ||Z]v[||^ = {ZmjZm)^.^ (^ ^ {^'-^})i ^^^d denotes the parameter 
vector of the unknown transformation $. The numerator term measures the 
"overlap" of the fields (in a certain relative position) , whereas the denomi- 
nator is a transformation invariant normalizing constant which ensures that 
Cab{4>) £ [~1) !]• Note that (4) can also be interpreted as the cosine of the 
angle between the two predicted fields in a certain relative position. 

We shall call the above similarity function the "Kernel Carbo function," 
as it is a modification of a similarity function proposed by Carbo, Leyda 
and Arnau (1980) in the context of field-based molecular alignment. The 
fields considered in that original paper are the electron densities of the two 
molecules under study, and the similarity was defined in terms of the L2 
inner product given in (3). As both fields in our setting are members of the 
RKHS 7^0-, the Carbo similarity function can be "kernelized" by replacing 
{•,-)l2 with (•,-)-Hct, which has the advantage that calculating (4) does not 
require evaluation of overlap integrals over M*" for any choice of positive 
definite covariance function. 

For the reproducing kernel we shall consider the istropic Matern co- 
variance function, where the covariance of the field between any pair of 
points x,y is given by 

, , 1 /2l/V2||x-y||V /2l/V2||x. 

(5) ^(x-y) = ;^-TT;73 ^-^^ ^J 



2''-^r{u)\ p J \ p 

This provides a fiexible family of stationary covariance functions [Stein 
(1999), page 31]. With this particular parameterization [e.g., Handcock and 
Wallis (1994)], p is a range parameter and i' determines the smoothness 
of the random field. Moreover, Ky{-) is the modified Bessel function of the 
third kind of order v and r(-) is the Gamma function. Note that j^ — )• 00 
corresponds to the Gaussian covariance function 

(6) cr(x-y)=exp{-||x-y||Vp^}, 

and in this particular case the L2-Carbo index of our predicted fields could 
be calculated analytically. 

Optimizing (4) with respect to the transformation parameters yields the 
"Kernel Carbo index" 

(7) C(Ai?) = supCAB(<^) = sup- ^^a,^b)h. 



(t> 4> WZaWh^WZbWhct 

in which configuration B is transformed (by the relative transformation 
function $) to be as similar as possible to configuration A. In the case (1) 
where the rigid body transformations in M"* are considered, the parameter 
vector cj) contains m{m — l)/2 Euler angles for rotation and m translation 
parameters, and in this case the Kernel Carbo index is invariant under the 
rigid body transformations of A and B. 
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Note that the optimization in (7) is not straightforward in practice due 
to local maxima. As an approximation to using the Kernel Carbo index 
in (7), we will therefore propose a Bayesian model and find the value of the 
similarity index (4) at the maximum a posteriori (MAP) estimates of the 
transformation parameters. Also note that, in situations where a dissimilar- 
ity rather than a similarity measure is required, (4) can be uniquely mapped 
into the appropriate codomain using 

and applying the same transformation to (7) or its MAP equivalent then 
yields a transformation invariant dissimilarity index between two marked 
point sets. 

2.4. Masks. In many applications it is of interest to match parts of ob- 
jects rather than the entire configurations. Our steroids application is one 
such example because only a part of each molecule may fit into the bind- 
ing pocket of the common receptor and is hence relevant for the binding 
mechanism. As a tool for matching only parts of the given configurations, 
we consider a set of masks (indicator parameters) which signify if individual 
points are included in the predicted field or not. The masks therefore allow 
for the possibility that only parts of the structures match, whereas other 
parts may have been generated by different underlying reference fields or 
may be largely affected by noise. 

From now on we will just consider rigid body transformations between 
the point sets, with rotation matrix T and translation vector 7, although, 
as mentioned above, the approach can be extended to other transformations. 

Let Am = (Af , . . . , A^^)^ be the mask vector for point set M {M £ {A, B}). 
Each entry of the mask vector is an indicator function, that is, A^ G {0, 1} 
which determines if the ith point of set M is considered to contribute to the 
matching parts (A^ = 1) or not (A^ = 0), i = 1,. . . , kM- Taking the mask 
vector into account, the predicted version of the common reference field ba- 
sed on M then has the form Zm(x; Am) = Sj- am=i ^I^('^m)<7(x^ — x), and 

the resulting partial Kernel Carbo function for two masked fields .^a(x; Aa) 
and Zb(x;Ab) in a certain relative position becomes 






(9) CAB(r,7,AA,AB)= Y. Z2 *f(AA)*;(AB)a(xf-(rxf+7)), 



where the tilde indicates that the kriging weights are normalized by the cor- 
responding term in the normalizing constant, that is, i()^(Am) = w^(Am)/ 
-^m(Am)5 with A'm(Am) = ||^m(x; Am)||-h<t- The partial Kernel Carbo index 
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can then be obtained by maximizing (9) over the transformation and mask 
parameters. 

Optimizing the similarity measure (9) over aU possible subsets is very 
challenging due to the combinatorial nature of the search space. Instead we 
use a Bayesian model to obtain the MAP estimates of the similarity transfor- 
mations and masks and then evaluate (9) at the MAP, which approximates 
the maximization of (9). Rather than trying to develop a realistic proba- 
bilistic model for the data, we therefore view the Bayesian model and the 
resulting MCMC scheme as a practical approach for generating an algorithm 
to match two spatial point patterns. Also, apart from transforming the prob- 
lem into a more tractable one, the Bayesian setting allows the introduction 
of prior information about the parameters which will be useful, for example, 
to prevent excessive masking. 

3. Bayesian pairwise alignment of marked point sets. 

3.1. Likelihood. With the assumption that the similar parts of the two 
point sets are noisy pointwise observations of the same underlying refer- 
ence field, we define the likelihood for the two marked point sets z^ = 
{z^{xf ),..., z^{x^ J} and z^ = {z^{xf), . . . ,z^{x^^)} in the relative po- 
sition defined by T and 7 as 

(10) L(z^, z^|0,7, Aa, Ab, t) oc r exp(-ri:>AB(r,7, Aa, Ab)), 

where 6 denotes the vector of the Euler angles which specifies a rotation 
matrix T{9), 7 denotes a displacement vector between A and B, t £ M"*" is 
a precision parameter, and L'AB(r,7, Aa, Ab) is the dissimilarity function 
based on (8) and (9). Here, the mask vectors play a similar role as the label- 
ing matrices in Green and Mardia (2006), Dryden, Hirst and Melville (2007) 
and Schmidler (2007), except in our framework there is no need to estab- 
lish correspondences between points in A and B. Instead, the mask vectors 
are defined separately for each point set. The pairwise correspondence does 
not need be estimated because all possible pairs of atoms are considered in 
the model, and the pairs are weighted according to how far apart they are 
during the matching. 

Note that if r is fixed, the likelihood is maximized at the same rotation, 
translation and mask parameter estimates that give the maximum value of 
the partial Kriged Carbo index (9). This, and the fact that it performed well 
in pilot simulations, provides the motivation for the use of this likelihood. 
Other choices include the half-normal likelihood 

L{A, B\e, 7, Aa, Ab, r) a r^/^ exp{-TDl^{T,-f, Aa, Ab)), 

which is less accommodating of outliers but might be preferable in some 
situations. 
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3.2. Prior distributions and posterior sampling. We do not have any 
prior information about the rigid body parameters 6 and 7 so that they 
are treated as uniformly distributed on SO(m) and on a large bounded re- 
gion in M™, respectively. The uniform distribution on SO(m) is determined 
by the probability measure which is invariant under the group action. In the 
two-dimensional case, fu{^) ^ 1- For ?7^ = 3, the appropriate density with 
respect to the Lebesgue measure depends on the parametrization of S0(3), 
and in this paper we use the Euler angles in the so-called x-convention where 

(cos ^3 
-sin^s 

In that case, fu{^) ^ cos(^2) and with the domains — vr < 61,63 < vr and 
—tt/2 < ^2 < 7r/2, every T G S0(3) is uniquely determined apart from a sin- 
gularity at 62 = — 7r/2. 

To prevent the situation where only very few points are used in the field 
comparison, we introduce a (fixed) penalty parameter C ^ and a (fixed) 
interaction parameter C/ ^ to define the joint prior density of the mask 
vectors as 




7r(AA,AB|C,C/)ocC^^^^+S»^?+C, ' 



E.A.|Af-AA|+j:_B lAf-Afl 



7 I Z_/ .h5 , I z J 



where i ~ j means that points i and j are neighbors within M (M E {A, B}), 
for example, if ||x^ — x^|| < 6. Note that the dimensions of Aa E {0, 1}*^'^ 
and Ab G {0, 1}^^ are fixed. The penalty parameter (^ inherently comprises 
prior assumptions about the extent of the matching parts of A and B, with 
higher (^ indicating more prior matching points. Also, if the interaction pa- 
rameter C/ is strictly greater than 1, this indicates clustering so that nearby 
points within a point set are expected to be included (or excluded) together 
in the matching. Thus, a large positive C/ would be used when we wish to 
encourage contiguous regions to be included in the matching, although we 
shall use Ci = 1 in our applications. 

With the further assumptions that the precision parameter is Gamma 
distributed a priori, that is, r~r(a,/3), and that all unknown parameters 
are independent a priori, the joint posterior conditioned on the given data 
is 

7r(0,7,AA,AB,r|z^,z^,a,/3,C,C/) 

ocT"exp{-T(Z)AB(r,7,AA,AB) + /3)}-^(AA,AB|C,O)-/c/(0). 

Note that this can be regarded as a mixture model over {0,1}'^^ x {0,1}'^^. 

We use MCMC to sample from the posterior distribution. The resulting 

point estimates for the rigid body parameters and the mask vectors can 

then be substituted into L'AB(r,7, Aa, Ab) to yield a point estimate of the 
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dissimilarity measure 

(11) DAB = l?AB(f,7,AA,AB). 

In the MCMC scheme, r is updated with a Gibbs step. Updated versions of 
the other parameters are obtained in four blocks, each using a Metropolis- 
Hastings step. For the rigid body parameters, we use independent normal 
proposals, and a proposal distribution for the masks vectors Aa and Ab can 
be obtained by choosing an entry at random and then switching its value 
from zero to one or vice versa. 

The algorithm we use ensures that the defined Markov chain is irreducible, 
aperiodic and positive recurrent, and, hence, after a large number of itera- 
tions the simulated value is approximately generated from the posterior dis- 
tribution. Due to the symmetry of the proposal distributions, convergence to 
and sampling from the limiting distribution in practice thereby results in an 
approximate stochastic minimization of the dissimilarity term, and this be- 
havior can be emphasized by choosing a prior distribution with a large mean 
for T. In fact, if one is mainly interested in obtaining point estimates of the 
model parameters which provide a good superposition, simulated annealing 
[Kirkpatrick, Gelatt and Vecchi (1983)] can be employed so that the MCMC 
algorithm simulates from 7r(0,7, Aa, Ab,t|A, S,a,/3,C,C/)^'^, where T > 
is slowly reduced deterministically. 

As with any MCMC scheme for a complicated high-dimensional problem, 
there is a possibility that the chain will become stuck in a local region of 
maximum posterior probability, and our application is no exception. Hence, 
judicious use of proposal distributions is required to escape such regions, for 
example, the use of occasional large proposal variances. 

Note that the partial Kriged Carbo index and the Bayesian model are 
symmetric in terms of which point set is denoted as A and which point 
set is denoted as B. However, for a practical implementation one of the 
points sets is chosen as B and transformed to be as close as possible to 
the other point set A. As our method is simulation based, slightly different 
estimates will be obtained in matching A to B and then B to A. Hence, in 
our application we carry out both matches and then take an appropriately 
symmetrized average of the estimated distance measures, for example, their 
geometric mean. 

3.3. Multiple alignment. In the multiple alignment problem, the objec- 
tive is to simultaneously superimpose a set of n marked point sets Mi , . . . , M„. 
Previous approaches to this problem include Dryden, Hirst and Melville 
(2007) and Ruffieux and Green (2009). Here, we adapt the generalized Pro- 
crustes analysis (GPA) algorithm for discrete landmark data [e.g., Dryden 
and Mardia (1998), page 90] to our field-based approach. In the classical 
GPA context it is of interest to find an alignment of the given objects which 
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minimizes the sum of their pairwise squared distances. A similar goodness- 
of-fit criterion for the multiple superposition of n predicted masked fields 
can be formulated in terms of their overall similarity as 

n—l n f 



= li=j+l ^/.AJ = l,/.;^J=i 



(r,4+7,))}, 



where A^ = (Af , . . . , A^) G {0, 1}^> ^', 6^ = {Oj,...,el) e M™H-i)"/2 ^nd 
7"^ = (7^", . . . ,7n) S IR™" denote the stacked vectors of the involved mask, 
rotation and translation parameters, respectively, and Fj = Ti{9i),i = 1, . . . , n. 
Moreover, A^ denotes the Zth entry of the mask vector Aj, xj is the Carte- 
sian coordinate vector of the /th landmark in the zth point set, and wl{Xi) 
denotes the corresponding normalized kriging weight. For the multiple align- 
ment of Ml, . . . ,Mn we want to maximize (12) with respect to the m^m — 
l)n/2 + mn + '^^ ki parameters. 

Define a "normalized mean field" of all but the ith. point set as 

%(x;A(i),6>(,),7(,)) = — y^ Y^ wj{Xj)a{(Tj^j+-fj)-^), 

where 0j) = (0f,...,Ci,d'---'^n),75) = (7r,-.-,7f_i,7f+i,---,7D 
and A(j) = (Ai ,...,Aj_i, Aj_^i,...,A„) and let C(j)(0i,7j, Aj; 0(j),7(j), A(j)) 

denote the inner product of Z(j)(x; A(j),0(j),7(j)) and Zi{x]Xi,di,^j^). It can 
be seen that (12) has the property 

1 " 
C{(^,1,^) Of ~ / ,C(i){6i,fi,Xi;6^i),fi^i),X(^i)). 
n . 

Due to this decomposition, the optimization can be carried out stepwise 
by maximizing C(i)(0i,7j, Ai;0(j),7(j), A(j)) in turn. The vectors 0(i), 7(j) 
and A(j) are thereby kept fixed at each step. 

An optimization of the overall Kernel Carbo index C(0,7, A) is numeri- 
cally difficult. However, we can replace it by posterior inference within the 
MCMC scheme developed for the pairwise alignment. As before, the choice 
of the prior distribution for the precision parameter r determines how much 
the algorithm pushes the estimates of the other model parameters toward 
the posterior mode. An iterative stochastic optimization of the normalized 
fields Zi{x) can therefore be formulated by employing a "large precision 
version" of the MCMC algorithm for the pairwise alignment and then using 
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Algorithm 1 Stochastic field GPA for unlabeled marked point sets 

1: choose the smallest point set as reference and superimpose the n — 1 remaining 
configurations onto it 

2: define d <— doj where do > tol and tol is a positive tolerance threshold 

3: calculate the multiple Carbo index C{6,j,X) 

4: while d > tol do 

5: for i in (1 : n) do 

6: using the current parameter values for rotation, translation and mask 

vectors, calculate a normalized mean field Z(i) (x) omitting the ith config- 
uration 

7: based on the dissimilarity D/^\{9i,^j^^\i), superimpose the ith predicted 

field onto Z(j')(x); Z(,;)(x) thereby takes the role of the reference field and 
A(i), 0(i) and 7(j-) are treated as fixed 

8: record the MAP estimates for position and mask of the ith configuration 

9: end for 

10: calculate the updated Carbo index C*(0,7, A) 

11: d^C*(6>,7,A)-C(0,7,A) 
12: C(0,7,A)^C*(0,7,A) 
13: end while 

the obtained MAP estimates to determine a new mean field. This procedure 
will in practice decrease C(0,7,A) at every step and can be repeated until 
a convergence criterion is met. 

Our field GPA algorithm is displayed as Algorithm 1. As the objective of 
the multiple alignment of the given marked point sets is to find the features 
common to all or most of the objects, the algorithm superimposes each point 
set on the smallest (in terms of the number of points) one in the data set as 
a first step. Contrary to the pairwise alignment which started at a random 
place in the parameter space, this initialization will be close to the global 
optimum which justifies the use of the large prior mean for the precision 
values. All the methods described in this paper have been implemented 
in R [R Development Core Team (2011)], and the code can be found in the 
supplementary materials [Czogiel, Dryden and Brignell (2011a)]. 

Note that the multiple alignment method assumes a common underlying 
reference field for all point sets. However, in our steroid application the 
molecules may exhibit different binding mechanisms even with the same 
receptor. In this case, several reference fields could underlie the matching 
parts of the molecules. As we discuss below, we therefore consider distinct 
sub-groups of molecules (e.g., based on chemical properties or from cluster 
analysis) and then look for common reference fields in various subgroups. In 
other applications, a similar subgroups based approach may also be suitable. 
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Fig. 1. Example of an underlying reference field and two sampling schemes: the underly- 
ing reference field (middle) is a realization of a zero-mean isotropic Gaussian random field 
with a Matern covariance function (v = 1 and p = 0.2j. The other plots show two sampling 
schemes for B*"'"'' (hig circles) and ^'■^"'^ (small circles): ub ~ nx = 80 and k— 1 on the 
left-hand side and ub = ua = 40 and k = 4 on the right-hand side. The dots correspond to 
the 961 possible point locations. 



4. Simulation studies. 



4.1. Simulation of marked point sets in two dimensions. We first carry 
out a two-dimensional simulation study to illustrate the methodology and 
examine the performance of the algorithms for different choices of parame- 
ters. We simulate marked point sets A = {z {:x.i),. . . ,z {'k^)} and B = 

{z (x^),...,z (x^ )} which share a common underlying reference field. 
As a reference field, we use a realization of a zero-mean Gaussian ran- 
dom field with an isotropic Matern covariance function defined on a grid 
of 961 regularly spaced points yj within the unit square, that is, we simu- 
late from Z = (Z(yi), . . . , ^(ygei))^ ~ N(0, S), where ^ij = a{\\y, - yj\\) is 
given in (5). For our simulations we use p = 0.2 and v = \. Figure 1 (middle) 
shows a realization z of Z. 

Let A = {^true^^cont| ^^^^ ^ ^ |^truc^ ^cont|^ ^^^^.^ u^^^^„ ^jgnotes the 

part of each point set which stems from the underlying reference field z and 
"cont" denotes the contaminated part. The term "contaminated" refers to 
the points which do not follow the field model well and so will not be helpful 
in the alignment. Hence, the contaminated points should be masked in the 
matching algorithm. 



We obtain B "^ by randomly choosing k^^"^ entries ii. 



■ ; ^k^J 



from z 



and adding independent Gaussian noise with standard deviation as to the 
corresponding marks. For B^°^^, feg"^* = k^ — /cg"^ locations on the grid are 
chosen at random and the corresponding marks are random values from 
a uniform distribution on [— c, c]. To obtain A^'^'^^, we introduce a nearness 
parameter k S N and define a set of grid points Z/Z^ as the union of neigh- 



borhoods around the points x.^ 



1 k 

1, . . . , «,g 



true\ 



where each neighborhood 
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contains the vertically, horizontally and diagonally adjacent grid points in 
a {2k + 1) X {2k + l)-box around the corresponding xf. The parameter k 
therefore measures the nearness between points in terms of grid point loca- 
tions rather than Euclidean distance which is further demonstrated in Fig- 
ure 1. The point locations x;^ (i = 1, . . . , A;^'^'^) which belong to the matching 
part of A are then chosen at random from U^^ and A*''"° is obtained by adding 
independent Gaussian noise with standard deviation a^ to the corresponding 
marks z{xf) {i = l, . . . , kf"")- Finally, the k"^""^ = kx- k^'''' points in A™"^* 
are obtained in the same way as the contamination points in B. 

Note that this simulation scheme does not create pairwise correspondences 
between points in A^'^^^ and B^^^°. Although we have used a nearness crite- 
rion in our simulation method, we have not estimated point correspondences 
in the course of the MCMC algorithm. 

For our simulation study we consider three realizations of Z, and for each 
of these realizations we define 12 different pairs of marked point sets by vary- 
ing the parameters fe*™*^ = /c^™^ = fe^™'' € {40,80}, A:™"* = fe^^""* = A;™°* e 
{0.05/c*™^0.1A;*=■"^0.15A:*™^} and k € {1,4}. Moreover, we choose c = 7 and 
as = \/0.02. Generated as above, the 36 pairs A and B are recorded in the op- 
timal relative position, and the optimal mask vectors are A a = (irtruc, OTcont) 

and Ag = (l^truciO^c 



X.cont ; 



4.2. Hyperparameter settings. For each pairwise superposition 50,000 
MCMC iterations are carried out, and each iteration contains five blocks 
updating the rotation parameter (proposal standard deviation: 0.75°), the 
translation vector (proposal standard deviation: 0.01), the precision param- 
eter and the two mask vectors, respectively. The Kernel Carbo similarity 
calculations are based on the exponential kernel, that is, (5) with v = 0.5 
(whereas i^ = 1 was used for simulating the data). Initially we use p = 0.6, 
but, within the first 1,000 iterations, this value is dynamically reduced to 
p = 0.2. This initial phase allows the algorithm to home in on a good align- 
ment even if the two points sets are far away from their optimal relative 
position. Moreover, we use /3 = 0.05 and a = 200, and these values ensure 
a desirable interaction between the obtained dissimilarity values and the pro- 
posed precision values at each iteration. We include C as a variable parameter 
in our simulation study and consider ( E {10,50,90}, and we fix ([ = 1. 

To overcome the difficulty of getting trapped in local modes, we propose 
a big change of the rigid body parameters by using increased values for 
the standard deviations of the random walk proposals every 125 iterations. 
Moreover, we restart the algorithm if the Carbo distance exceeds 0.3 after 
7,500 iterations. 

4.3. Results. For each of the 108 (36 pairs of point sets x 3 values of C) 
considered MCMC runs, the starting position of the movable point set B is 
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Fig. 2. Top row: trace plot of the rigid-body parameters (in terms of the initial relative 
position of the two points sets under consideration). Bottom row: trace plots of the precision 
parameter, the log-posterior (up to a constant) and the Knged Carbo distance. In all plots, 
every 100th simulated value is displayed. 



obtained by rotating and translating it from its original (simulated) position 
using T{6q) and 7q where Oq and 7oj (i = 1,2) are uniformly distributed 
on [—20°, 20°]^ and [—0.1,0.1], respectively. Moreover, both mask vectors 
are initiated using A,^ ~ Bernouni(0.5) {i = l,...,kM, Me {A,B}). The 
performance of each run is then assessed in terms of the root mean square 
deviation (RMSD) between the original position of B and its MAP position. 

Figures 2-4 show the typical output of a successful run. Figures 2 and 3 
indicate that the algorithm converges quickly, and from Figure 2 it can be 
seen that there is an interplay between the precision parameter r and the 
Kernel Carbo distance: until a good alignment is obtained, small distances 
lead to larger precision values which in turn yield small distances. Figure 3 
shows the trace plots for the number of points which are involved in the field 
calculation and are hence considered to belong to A^^^° and B^^^'^, respec- 
tively, and a (post burn-in) summary of the two mask vectors is displayed 
in the bottom row of Figure 3. In this particular example, the optimal mask 
vectors are Aj^ = (1|q, 0^2) {M = A, B), and the algorithm is able to recon- 
struct the mask vector very well. Figure 4 shows that the MAP position of 
the movable point set is very similar to the original one. 

We consider an alignment to be successful if RMSD < 0.1. Table 1 shows 
the percentages of success for various parameter settings in the row "set- 
ting 1," with an overall success rate of 76% out of the 108 MCMC runs. 
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As expected, the largest number of true points in combination with the 
fewest number of contamination points gives the highest success rate (89%), 
whereas the smallest number of true points in combination with the high- 
est number of contamination points gives the lowest success rate (44%). In 
combination with these extreme cases, the impact of the nearness parameter 
is most striking with 22% success for (fc*''"^ = 40, /c™''* = 6, k = 4) and 100% 
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Fig. 4. Successful alignment: the circles on the left-hand side show the initial position 
of point set B, and the circles on the right-hand side show the position of B at the MAP 
iteration. The optimal position is displayed by the crosses on both sides. The algorithm is 
able to reduce the RMSD to the optimal position from 0.479 (left) to 0.032 (right). 
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Table 1 

The percentages of MCMC runs which are regarded as a success (i.e., RMSD < 0.1^ for 

different parameter settings. The column "all" shows the percentage of successes out of 

all 108 simulations for the corresponding setting 













^true ^ go 


^true ^ 40 








All 


C = io 


C = 50 


C = 90 


fccont ^ 4 


fc"°"* = 6 


K = l 


K = 4 


Setting 1 


76 


61 


81 


86 


89 


44 


85 


67 


Setting 2 


48 


33 


47 


61 


83 


17 


52 


44 



for (/c*"^"*^ = 80, /c'^°°* = 4, K = 1). Overall, the impact of k can be summarized 
as 85% success for k = 1 and 67% success for k = 4. Interestingly, the success 
rate increases with C,. 

The above results indicate that a satisfactory alignment can be obtained if 
the number of noncontamination points is large enough to represent the main 
features of the underlying reference field and large relative to the number 
of contamination points. Moreover, especially when the number of points is 
small and the sampling of the reference field is sparse, it is important that 
the noncontamination points in A and B represent the same features of the 
reference field (which is not always the case if A;''''"^ = 40 and k = 4) . These 
trends can be emphasized by rerunning the experiments using 9 ~ C^[-60°,60°] 
and 7j ~ f^[-o.3,o.3] (* = 1)2) to obtain the starting position of B. For this 
more challenging setting ( "setting 2" ) the results are also provided in Table 1 
with similar effects but lower success rates (48% overall). 

In both settings, the performance of our alignment procedure can be im- 
proved if there are some points in A and B which can be identified as 
noncontamination points ah initio. For our examples, identifying some rele- 
vant points (on average 12 per point set) improves the overall success rate to 
93% in the first setting and to 79% in the more challenging second setting. 
In many applications it may be possible to identify some relevant points so 
that the possibility of incorporating this knowledge is a valuable tool for 
improving the alignment. 

Finally, we rerun the above experiments with different values for the range 
parameter p. For example, with p = 0.3, overall success rates of 77% in 
the first and 48% in the second setting are achieved, and for /? = 0.1, the 
corresponding success rates are 77% and 52%. These results demonstrate 
that choosing the exact covariance function for the spatial interpolation is 
not crucial for the performance of the algorithm, although performance does 
deteriorate for much larger p. For example, a leave-one-out type method 
for identifying the contamination points combined with a pooled version of 
an experimental semivariogram [e.g., Wackernagel (2003), page 47] can be 
applied to estimate an approximate covariance function which has yielded 
satisfactory results in some further experiments. 
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4.4. Three-dimensional example. We now consider a small three-dimen- 
sional simulation study which mimics the molecule alignment problem of 
Section 5. As a starting point we take the positions of the first 25 atoms of 
the first molecule in the steroid data set and generate the atom positions 
of a second "molecule" using a small perturbation (independent zero mean 
normal with standard deviation 0.01). Then a zero mean isotropic Gaussian 
random field with Matern covariance function [v = 0.5, p = 5) is simulated 
at the combined set of the 50 points. To introduce contamination points, the 
last five points in each configuration have their coordinates and marks per- 
turbed by independent A^(0,3^) errors. Finally, both molecules are centered 
and the molecules are uniformly rotated. 

For various choices of the hyperparameters /3 and C we run 100 Monte 
Carlo simulations of the Bayesian alignment procedure. Each time the two 
marked point sets and their starting relative position are generated as above. 
The parameters u = 0.5 and a = 31 are kept fixed and the range parameter 
is dynamically reduced from p = 20 to p = 5 during the matching procedure. 
Each simulation is restarted if the Kernel Carbo distance is greater than 
0.1 after 1,000 iterations (up to a maximum of 30 restarts). When the al- 
gorithm reaches 2,000 iterations the final position and the MAP position of 
the movable molecule B are recorded. 

In this situation the success of the algorithm can be measured in terms 
of the first 20 atoms of B by taking the corresponding RMSD between its 
MAP and its true position. The results of the simulation study are given in 
Table 2. As expected, the number of unmasked points in B increases with (. 
Interestingly, this consistently also leads to improved RMSD values — even 
in situations where a large value of (^ forces the algorithm to include more 
than the desired 20 points. In terms of the obtained Carbo distance, the 
impact of /3 exceeds that of C- This is also expected, as the mean of the 
full conditional distribution of the precision parameter r (cf. Section 3.2) 
decreases with /? which in turn means that the algorithm is more prone to 
accept updates with larger Carbo distances. 

Overall, this simulation study highlights that the Bayesian method works 
well in this controlled situation. 

5. Application to steroid molecules. The concept of molecular similarity 
is of great importance because similar molecules can be expected to exhibit 
a similar biochemical activity and hence drug potency. The data for the 31 
steroids considered by Dryden, Hirst and Melville (2007) are given in the 
form of a set of unlabeled, marked points where the coordinates of the points 
correspond to the atom coordinates of each molecule, and the marks are the 
partial charge values and the van der Waals radii. The data set can be found 
in the supplementary materials [Czogiel, Dryden and Brignell (2011b)]. The 
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Table 2 

Summary statistics from, the posterior distribution in the simulation study. Columns 2-6 

show the mean (and standard deviation) over 100 Monte Carlo simulations of the final 

number of unmasked points in molecule A (^ A^ ); the final number of unmasked points 

m molecule B (Yl,)^ ); the root mean square error fRMSD^; the number of new starts 

needed for the algorithm to be successful; and the Kriged Carbo distance at the final 

iteration. The last column shows the number of times out of 100 simulations that the 

algorithm failed, that is, the Kernel Carbo distance was greater than 0.1 after 1,000 

iterations for each of the maximum number of 30 restarts 
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(0.0004, 10) 


18.41 
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0.0204 
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(2.39) 


(2.13) 


(0.5207) 


(2.48) 


(0.0226) 




(0.0004, 50) 


21.16 


20.00 


0.0959 


1.56 


0.0178 
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(1.45) 


(0.97) 


(0.6521) 


(3.96) 


(0.0195) 




(0.0004, 70) 


21.66 


20.43 


0.0626 


1.12 


0.0263 







(1.53) 


(0.97) 


(1.1200) 


(2.80) 


(0.0208) 




(0.004,10) 


18.6 


17.83 


0.2193 


0.67 


0.0268 







(2.51) 


(1.90) 


(0.5492) 


(1.21) 


(0.026) 




(0.004,50) 


21.52 


20.27 


0.1018 


1.55 


0.0284 


1 




(1.46 


(1.08) 


(0.4352) 


(3.80) 


(0.0560) 




(0.004,70) 


21.58 


20.32 


0.0605 


1.19 


0.0280 







(1.42) 


(1.17) 


(0.0734) 


(3.34) 


(0.0244) 




(0.04,10) 


20.90 


19.47 


0.1306 


0.95 


0.0342 







(1.75) 


(1.60) 


(0.5884) 


(1.57) 


(0.0187) 




(0.04,50) 


23.03 


20.94 


0.0739 


1.59 


0.0485 


1 




(1.35) 


(1.19) 


(0.2748) 


(3.75) 


(0.0544) 




(0.04,70) 


23.15 


20.92 


0.0513 


2.26 


0.0472 







(1.37) 


(1.04) 


(0.0629) 


(4.41) 


(0.0258) 





Kernel Carbo index developed in this paper can therefore directly be uti- 
lized to assess the similarity between the steroids. Also, in particular, the 
assumption of a common underlying reference field seems suitable for this 
application because all molecules bind to the same receptor protein. The 
underlying reference field can therefore be interpreted as a negative imprint 
of the binding pocket of the receptor. The MCMC scheme described in Sec- 
tion 3 then determines the parts of each molecule which correspond to the 
reference field and aligns the molecules based on the similar parts only so 
that the resulting relative position should reproduce the relative binding 
positions of the steroids. 

In order to investigate the possibility of multiple binding modes (and 
hence reference fields), we shall also consider an analysis of subgroups of the 
data. In particular, we consider the three activity classes. 
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5.1. Pairwise alignment. In our application we use the Gaussian kernel 
(6) for the spatial interpolation of both the partial charge values and the van 
der Waals radii. The range parameter p for the electrostatic field is thereby 
estimated by visual inspection of a pooled empirical semivariogram function 
{pQ = 6.35), and the practical range of the steric (shape) field is taken to be 

the largest van der Waals radius in the data set {ps = 1.7/\/3 = 0.9815). 

In our simulation studies we dynamically reduced the range parame- 
ter to help the algorithm home in on a good solution. Here, we apply 
a different concept using a weighted average of the two univariate par- 
tial Carbo indices and choosing the weights dynamically as wq = ^~^ and 

ws = -^,i = l,. ■ ■ ,Nj, during an initial phase of Nj = 1,500 iterations. This 
directly mimics real-life molecular recognition where the long-range electro- 
static attraction governs the initial approach of the molecules, whereas the 
short-range repulsive steric forces gradually take over and become the chief 
manipulator for the binding affinity [e.g., Richards (1993)]. 

We use a = 31 and f3 = 0.04, and the value for the penalty parameter is 
chosen as C = 3. As standard deviations of the proposal distributions we use 
r]i = 3.25° for the rotation parameters and 7/2 = 0.25 A for the translation 
parameters, and these values ensure acceptance rates between 20% and 40%. 
The standard deviation for the rotation parameters is thereby in line with 
previously described proposal distributions for rotation parameters in the 
molecular context [e.g.. Green and Mardia (2006)]. We define the initial 
relative position of two molecules by first aligning both molecules along 
their principal axes. We then translate and rotate the movable molecule 
using 7q and T{6q), where 7oj (i = 1,2,3) and ^oi (^ = l5 2,3) are uniformly 
distributed on [—5 A, 5 A] and [—90°, 90°], respectively. 

As our MCMC algorithm is asymmetric in the sense that the relative po- 
sition of the molecules is changed by moving only one molecule whereas the 
position of the other one is fixed, we consider all 31 • 30 = 930 pairwise super- 
positions. In the majority of cases, the algorithm converges quickly and the 
trace plots show a similar behavior as the ones in Figures 2 and 3. However, 
like in the simulation studies, the algorithm can sometimes get trapped in 
a local mode (which mostly corresponds to an alignment along the wrong 
principal axes in this application) so that a restart is necessary. Figure 5 
shows an example result where the steroid aldosterone has successfully been 
superimposed onto androstanediol. 

The specific choice of Gaussian kernel is not crucial to the success of the 
algorithm for the steroids data. Similar alignments of the steroids have been 
obtained using the Matern kernel with 1/ = 0.5, ;y = 1, ;y = 2 for many exam- 
ples, but it is important that the range parameter is well chosen. We found 
that with Ps ^^ the method worked well for any choice of covariance kernel 
we used, but if ps is too large (e.g., P5 ~ 3), then the alignment is cruder and 
the algorithm is more prone to failure for any choice of covariance function. 
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Fig. 5. Tioo orthographic projections (x-y and x-z Cartesian planes) of the atoms in 
the starting position (top row) and the MAP position for the alignment (bottom row) of 
steroid molecules aldosterone and androstanediol. The carbon rings are indicated by lines 
for each molecule. The unit of all axes is Angstrom (A). 



5.2. Prior sensitivity. To investigate the sensitivity of the analysis to the 
prior distributions, we again consider the ahgnment of the two molecules al- 
dosterone and androstanediol. Table 3 shows how different values of the 
penalty parameter (" affect the empirical (post burn-in) 95% credibility in- 
tervals for the number of included atoms for both molecules based on 10,000 
iterations. As expected, the total number of included atoms increases with (^. 



Table 3 

The impact of the penalty parameter (first four rows) and a (last four rows) on the 

marginal posterior distribution of the parameters of interest. The credibility intervals are 

based on every 20th value of the parameters recorded after the burn-in period 



c 


95% CI for T 


95% CI for Y.J >-f 


95% CI for Y.J ^f 


2 
3 

4 

5 


(226.62,543.78) 
(230.93,543.30) 
(250.69,562.65) 
(244.67,548.41) 


(34,46) 
(37,49) 
(40,51) 
(41,51) 


(34,45) 
(38,48) 
(40,49) 
(42,51) 


a 


95% CI for T 


95% CI for E^ Af 


95% CI for Y.J ^f 


21 
31 

41 
51 


(102.53,315.95) 
(221.14,515.13) 
(344.68,770.30) 
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As the two molecules in the example run are structurally very similar, they 
can be aligned more closely if more atoms are included so that the cred- 
ibility interval for the precision parameter is shifted toward higher values 
as C increases. After a certain threshold, however, even larger values for the 
penalty parameter force the algorithm to include more atoms in the similar- 
ity calculations than desired and the precision decreases. Moreover, Table 3 
shows that — in terms of the number of included atoms — the algorithm is 
robust against changes of a. Also, as the posterior mean and variance of 
the precision parameter directly depend on a, the credibility intervals for r 
become wider and get shifted toward higher values as a increases. 

5.3. Chemical relevance. The pairwise distances which result from the 
930 superpositions can be regarded as chemically meaningful if they reflect 
the membership of the steroid molecules to the three activity classes, that 
is, if steroids within an activity class can be aligned more closely than those 
from different activity classes. In terms of our assumption about a common 
underlying reference field, such a result would indicate that there are actually 
three different reference fields which exhibit different small scale variations 
and hence different abilities to fit in to the protein binding pocket. 

We assess the chemical relevance of our results by performing two clus- 
ter analyses using Ward's (1963) method. To account for the asymmetry 
in our alignment method, the applied pairwise dissimilarity measures for 
two molecules A and B are thereby based on both the MCMC run which 
superimposes A on B and the MCMC run which superimposes B on A. 



In particular, we use Dmean(A,S) = a/^a^^b^b^a and Dmap{A,B) 



D^^qD^^j^, where the arrow denotes the direction of the superposi- 
tion, and "mean" and "MAP" indicate which type of (post burn-in) point 
estimate for the parameters is inserted into the Carbo dissimilarity mea- 
sure (11). 

Figure 6 shows the dendrograms resulting from the cluster analyses. It 
is notable that both distance measures lead to a very good separation of 
high and low activity steroids. In particular, the cluster analysis based on 
-Dmap(') is at the highest level able to separate these two activity classes 
completely. Overall, our distance can separate the activity classes as well 
as the distance which Dryden, Hirst and Melville (2007) found to have the 
highest separation power, and it clearly outperforms the other distances 
defined in their paper. 

The dendrograms indicate that it is plausible to assume that there are 
at least two different reference fields underlying the steric properties of the 
steroids. It is therefore of interest to determine these fields and examine 
where differences occur, as they could give rise to the different binding ac- 
tivities. In the following we will do so in a two-step procedure where our field 
GPA approach is first applied to all 31 steroids to obtain the overall optimal 
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Fig. 6. Cluster analysis using Ward's method: the left-hand side dendrogram is based 
on Z)inean(')! '^'^'^ ^^^ dendrogram on the right-hand side is calculated using -Dmap(-)- 
The labels correspond to the activity classes of the steroids (1 — high, 2 = intermediate, 
3= low). 

relative position of the molecules and then to the subgroups as defined by 
the activity classes which will provide the appropriate masks. 

5.4. Overall multiple alignment. When carrying out the overall optimal 
alignment of all 31 steroid molecules, the pairwise superpositions in step 1 of 
Algorithm 1 are performed as described before but with (^ = 2 to incorporate 
the knowledge that the reference molecule in all superpositions has a small 
number of atoms. The superpositions on the mean fields (step 7) are obtained 
using only the dissimilarities of the steric fields. As the initial molecular fields 
obtained in step 1 are good approximations of the fields which minimize the 
multiple Kernel Carbo index, we use a = 600 and /3 = 0.0001 to ensure that 
the full conditional distribution of the precision parameter has a large mean 
value at each iteration, and we reduce the standard deviations of the proposal 
distributions for the rigid body parameters to rji = 0.75 A and r}2 = 0.03°. 
Moreover, we set the number of iterations for each MCMC run in step 7 to 
500, and the tolerance value to tol = 0.0001. The algorithm is therefore used 
as a stochastic optimizer. 

The algorithm converges after the fourth field GPA iteration. Figure 7 
shows orthographic views of the resulting overlays, that is, projections of 
the three-dimensional data into the x-y and x-z Cartesian planes. The 
superposition after step 1 of the field GPA algorithm is displayed in the top 
row, and the bottom row shows the final overlay. For clarity, the random 
starting positions of the steroids are not displayed in this picture. 

5.5. Alignment within activity class subgroups. We now carry out the 
field GPA algorithm in subgroups of the data to allow for the possibility 
of different underlying multiple fields. Specifically, we consider the three 
activity classes of high, medium and low binding affinity to the receptor. The 
estimated mask vectors from each underlying field are then used together 
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Fig. 7. Top roiu; orthographic projections of the relative position of the 31 steroid 
molecules that results from step 1 of Algorithm 1. Bottom row: orthographic projections of 
the final relative position. The random starting positions of the molecules are not displayed. 



with the relative position of ah molecules obtained in the overall field GPA 
to calculate mean fields for each group. 

Figure 8 displays different cross sections of the mean field for each activity 
class. Light points thereby correspond to locations where the displayed steric 
field takes a large value, whereas dark points show field values close to zero. 

As expected, the observed differences are most pronounced between the 
mean fields of the high and the low activity group. To assess the differences 
for each pair {Ca, Cfe) of activity classes (a, 6 = 1, 2, 3; a 7^ 6) numerically, we 
consider a (two sample) i-field of the form 



(13) 



ta6(x) 



Za(x) - Zfo(x) 



^poolWW^a + l/nb' 



xG 



where n^ and n^ denote the number of molecules in activity class Ca and C^, 
respectively, Za{^) and ^^(x) denote the corresponding mean fields, and 
'^pooi(^) ~ ■*pooi(-^) + f^ is the pooled variance (with d = 0.001 a small offset 
to avoid spuriously large values in regions far away from the center) . For each 
pairwise comparison we define a three-dimensional grid G and calculate a t- 
value of the form (13) at a large number of points (142,598 here). Here we 
use (13) as an exploratory tool to see where the most pronounced differences 
occur. Figure 9 shows the regions in which the (absolute) t-field for each 
comparison exceeds a threshold of 8. A formal test which takes into account 
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Fig. 8. Cross sections of the mean steric fields of the three activity groups (left column: 
high activity, middle column: medium activity, right column: low activity). The different 
rows display cross sections at z = —1.5 (top row), 2 = (medium row) and z — 1.5 (bottom 
row). Light points correspond to locations with large value of the displayed field, whereas 
dark values show points with values close to zero. 



the multiple comparison problem and the spatial smoothness of the t-field 
could be applied using a threshold based on the excursion sets of Gaussian 
fields [e.g., Worsley (1994), Taylor and Worsley (2008)], which has been 
extensively used in fMRI studies. 

From both Figures 8 and 9 it can be seen that the main feature which 
distinguishes the high activity class from the other two classes is that the 
very active molecules commonly extend to the left of the ring structure 








Fig. 9. Thresholded t-fields resulting from pairwise comparisons of the steric mean fields 
of the three activity classes. Left-hand side: low vs. medium activity class, middle: low 
vs. high activity class, right-hand side: medium vs. high activity class. The shaded areas 
display regions where the t -field takes absolute values of larger than eight. 
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much more than the other molecules, where by ring structure we mean the 
carbon rings as shown in Figure 5. Prom the original data we can get the 
additional information that the associated atoms are oxygen and carbon 
atoms. Another interesting difference is located at the top right-hand side of 
the molecules where the low activity class differs from the other two classes 
in the location of oxygen atoms. These findings are in line with Figure 9 
in Dryden, Hirst and Melville (2007) and support the conjecture that the 
steric properties of the steroid molecules have a discriminating effect with 
respect to the binding affinity toward the CBG receptor. 

6. Discussion. Our methodology for aligning and comparing unlabeled 
marked point sets is based on spatial interpolation of the given marks and 
hence on a continuous representation of shape. The major advantages of 
our approach are that point correspondences do not need to be estimated 
and that the incorporated mask vectors automatically determine the similar 
regions of the considered point sets while ignoring the rest, which helps to 
reduce the level of noise in the alignment procedure. 

Our approach is related to a number of previously proposed methods. 
For example, it provides a probabilistic framework and generalization of the 
SEAL algorithm [Kearsley and Smith (1990)] which is well established in the 
field of rational drug design and essentially uses the L2-Carbo index together 
with a Gaussian covariance function. Our multiple alignment approach is 
related to the Bayesian model proposed by Dryden, Hirst and Melville (2007) 
which uses a similar concept but formulated only in terms of the point 
locations. Contrary to that, a hidden point configuration in the fully model- 
based Bayesian approach by Ruffieux and Green (2009) is integrated out and 
the multiple alignment of n point sets involves all 2" — n — 1 possible types 
of matches. The fact that our field-based approach naturally incorporates 
the additional information given by the marks is an additional difference 
to the previous approaches which is of particular advantage in the multiple 
alignment setting, as the resulting mean fields allow straightforward post- 
processing. 

In this paper we obtain the similarity index at the maximum a poste- 
riori (MAP) estimates of the rigid-body transformations and mask param- 
eters because this gives an approximation to the Kernel Carbo index (4). 
We could alternatively consider a full posterior analysis and work with the 
posterior distribution. A similar issue occurs in Bayesian shape analysis of 
unlabeled landmark configurations [Green and Mardia (2006), Dryden, Hirst 
and Melville (2007), Schmidler (2007)] where either a marginal approach (in- 
tegrating out nuisance parameters) or a conditional approach (conditioning 
at the MAP) could be used. We compared the two approaches for unlabeled 
landmarks in other work [Kenobi and Dryden (2010)] and the overall per- 
formance was similar in the situations considered. This can be explained 
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by the similarity of the marginal and conditional posteriors when a Laplace 
approximation is accurate (e.g., highly concentrated posterior distributions 
for the nuisance parameters). 

Finally, as molecules are fuzzy bodies of electronic clouds rather than 
discrete sets of atoms, our approach is particularly suited for the described 
application. However, as it does not require any predefined point-by-point 
correspondence, it could be an approach to resolve the alignment problem 
for a fairly broad range of applications. Examples include matching organs 
in medical images, matching objects in images of real- world scenes (e.g., 
faces) in photographs or clouds in satellite images. 
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SUPPLEMENTARY MATERIAL 

Supplement A: R programs for Bayesian molecule alignment 

(DOL 10.1214/11-AOAS486SUPPA; .zip). The zip file contains R programs 
for molecular alignment using random fields. The main R program is fieldsS.r 
which carries out a Bayesian MCMC procedure. The programs were writ- 
ten by Irina Czogiel, with some later edits by Ian Dryden. There are two 
options in the program — simulation study (as in Section 4.4) of the paper, 
or comparison of two molecules using steric information (as in Section 5). 

Supplement B: Steroids data (DOI: 10.1214/11-AOAS486SUPPB; .zip). 
The zip file contains the data set of steroids first analyzed by Dryden, Hirst 
and Melville (2007). The data set of {x,y,z) atom co-ordinates and partial 
charges was constructed by Jonathan Hirst and James Melville (School of 
Chemistry, University of Nottingham) . 
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